Automatic readability classifier for European Portuguese
نویسندگان
چکیده
This paper describes a system that automatically classifies text readability for European Portuguese, while highlighting the key challenges on language features’ selection and text classification. To this goal, the system uses existing Natural Language Processing (NLP) tools to extract linguistic features from texts, which are then used by an automatic readability classifier. Currently, the system extracts 52 features grouped in 7 groups: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, and some extra features. A classifier was created using these features and a corpus, previously annotated by readability level, using a five-level language classification official standard for Portuguese as Second Language. In a five-level (from A1 to C1) and three level (A, B and C) scenarios, the best-performing learning algorithm (LogitBoost) yields 79.25% and 86.32%, respectively.
منابع مشابه
Automatic Text Difficulty Classifier - Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching
This paper describes a system to assist the selection of adequate reading materials to support European Portuguese teaching, especially as second language, while highlighting the key challenges on the selection of linguistic features for text difficulty (readability) classification. The system uses existing Natural Language Processing (NLP) tools to extract linguistic features from texts, which...
متن کاملAutomatic Construction of Large Readability Corpora
This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the fo...
متن کاملAutomatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features
This paper presents a classifier of text readability based on information-theoretic features. The classifier was developed based on a linguistic approach to readability that explores lexical, syntactic and semantic features. For this evaluation we extracted a corpus of 645 articles from Wikipedia together with their quality judgments. We show that information-theoretic features perform as well ...
متن کاملSIMPLIFICA: a tool for authoring simplified texts in Brazilian Portuguese guided by readability assessments
SIMPLIFICA is an authoring tool for producing simplified texts in Portuguese. It provides functionalities for lexical and syntactic simplification and for readability assessment. This tool is the first of its kind for Portuguese; it brings innovative aspects for simplification tools in general, since the authoring process is guided by readability assessment based on the levels of literacy of th...
متن کاملAutomatic identification of language varieties: The case of Portuguese
Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. Stateof-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classifica...
متن کامل